This project relies on data from New York City’s Taxi and Limousine Comission (TLC). NYC publishes this TLC data for all trips taken by Yellow Taxis, Green Taxis, For Hire Vehicles, and High Volume for High Vehicles. We rely on the Yellow Taxi data, as this is the transportation method most people use and are familiar with. NYC makes full trip data available starting in 2009, organized by month. Each month contains data on roughly 7 million trips. Given the size of this data, we are choosing to work with only data from 2019. A significant amount of data is available for each trip. The dataset contains information on: pickup and droppoff times, pickup and dropoff locations, rate cod, payment tipe, fare amount, credit card tips, and total amount.
For this project, Jae and Andrew received permission from Professor Brambor to work in a group of two. The two of us had been planning on this project using traffic data from New York City before this semester began. We also plan to expand on the scope of this project during the summer by implementing machine learning algorthims to build predictive models for this dataset. Given these factors, and our shared interest in this topic, we thought working in a group of two is most effective.
Our website is primarily split into two major portions. For the first part of our visualization, we used various forms of ggplots to demonstrate insightful points for duration of trips. This was done in the form of time series using a combination of aggregation and tailored configuration for each plot. For the second part of our visualization, we used variables that are most suited and ideal to be illustrated on maps for insightful analysis. This was done through leaflet maps with the implementation of aggregated data and custom feature additions.
Please keep in mind that all the graphs are interactive in nature, except for the exploratory data analysis section. Instead of focusing a specific variable, our group thought it would be more useful to give user the control to use the already-available insights from the visualization based on his/her individual needs, either in the perspective of the consumer or taxi driver. Moreover, for interactive leaflet graph, one needs to not only hover over the area of interest, but also click on the area to display greater information about the area pertaining to the relevant variables of the map.
Our team would also like to point out that the majority of the preprocessing and data scrubbing, as well as exploratory data analysis, are included as part of our process book, and is excluded from this section. This website only includes our final data viusalization outputs, in accordance with the instructions laid out in the class website.
When aggregating data, our team used median for all variables, apart from tip amounts. This was done because the data contained a reasonable number of outliers that might skew the average during aggregation. As such, we determined median was the best method to most accurately capture the insights from the data relevant to the variables of interest.
Please find the online live links to the process book and our presentation below. The original powerpoint slides and the R markdown file for the process book can also be found in the public repository for our project. It is important to note that the Youtube link cannot be opened within the rpubs website, but have to be copied and pasted in a separate tab or webpage due to security reasons imposed by Youtube.
Process Book: https://rpubs.com/jaeham/groupw_process_book
Brief Presentation: https://youtu.be/4u6H_43spF4
Before diving into our final design, our group wanted to first explore and understand the variables. This was done in the following steps. We first conducted a univariate analysis of the variables of interest, then concluded with a bivariate analysis to explore the relationship between the variables.
As you can see, this most likely resembles a right skewed distribution, where most of the trip duration centered around 5-20 minutes. This makes sense in real life as most people would get yellow cabs to travel for a short period of time within the city.
This also follows a right skewed distribution, although it is not as distinct as it was for trip duration. This is not surprising, as trip distance should be closely correlated with trip duration. It seems like most trips fall between 0.5 to 2.5 miles.
This shows a more uniform distribution than the last two charts. The most frequent trips are centered around 9-12 dollars. However, it is important to note it is still slightly skewed to the right with more trips falling in the more expensive range from 13.5 dollars to 30 dollars, than cheaper trips below 9 dollars.
From the multiplots of the three variables of interest, they all seem to share a positive relationship with each other as expected. This is also not surprising because greater the distance, one can expected the trip to last longer in duration, or vice versa. The same can be said about the fare price with either duration or distance of the trip. It is important to note, however, that the relationship between fare and duration or distance is more steep than between distance and duration.
Using ggplots, our group first decided to focus on the duration of the trip. We think there is an interesting story to be told from such variable. From a consumer perspective, it will be useful for someone to know when the trip will be shortest at what hour of the day, what day of the week or at what month of the year. Vice versa, it will also be useful for a cab driver to know when the trip will be the longest or shortest. Even though the same approach could have been incorporated for trip distance and fare amount, we thought this can be best shown on the map, instead of graphs. This is shown in our next section.
From the graph above, the peak rush hour time period is shown to be from 9 a.m. to 6 p.m. with the median duration hovering around 12 minutes, and the duration gradually declines after 6 p.m. The lowest points are around 5 a.m. or 6 a.m.. This is reasonable considering that most people work from 9 a.m. to 6 p.m., and people frequently use cabs throughout the working period in the bustling city of New York. Our group further dissected the duration over a day by comparing its trends by the day of the week. One can clearly see how weekdays follow a similar trend of following a sharp increase in duration of trips from 7 a.m. to 12 p.m., which is followed by a gradual decline. This effect is similar to the aggregated graph above, but the effect is more pronounced and the rush hour seems to start earlier. However, there is a clear difference in this trend when looking at the weekends; Saturday and Sunday both follow a gradual increase from 7 a.m. to 7 p.m. (Saturday) or 3 p.m. (Sunday). As such, one can clearly see people start their day more slowly on these days.Note: you can selectively click on a single day by double clicking on the desired day of the week on the legend box of the interactive map. This would apply to any interactive maps with the legend embedded within the graph.
Now, we shifted gears to study how the duration varies by the day of the week. This almosts seems like a normal distribution curve with Monday and Sunday having the two lowest duration (10.5 and 10 minutes), while Thursday marks the highest peak with around 12 minutes in duration. This trend was also reflected in the overall trend in previous graph when comparing day to day with Thursday having the highest duration on average, while Sunday had the lowest. Our group thought it would also be interesting to analyze the duration of weekdays by month, and compare month to month. It seems like June and October consistently share the highest peak throughout the days of the week, in general. This may be attributed to major holidays (i.e. summer break) or the start of schools (i.e. fall semester for universities), where people tend to travel more than other months of the year. In terms of day to day, thursday seem to have the highest duration of trips over any other day of the week, and Sunday have the lowest. This pattern is consistent with what we saw with the graph above.The following maps present different aspects of Yellow Taxi activity within New York City. These maps are divided into Taxi Zones defined by the city. Map visualizations have the advantage of clearly highlighting features by geography. The first set of maps examines the tipping behavior of passengers organized by pick up zone. The next set looks more broadly at NYC traffic patterns, analyzing which zones were more congested. The final map examines taxi activity as a whole.
Due to the incompatibility issue between R markdown and shiny app being embedded in a static R markdown html page, we published the R shiny app online in the following link: